The Red Wine by Pradip More

Introduction to the Data Set

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). This analysis is carried out only for the red wine data.

Description of attributes:

  1. fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  2. volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  3. citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

  4. residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  5. chlorides: the amount of salt in the wine

  6. free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and sulfide ion; it prevents microbial growth and the oxidation of wine

  7. total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  8. density: the density of water is close to that of water depending on the percent alcohol and sugar content

  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

  10. sulfates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant

  11. alcohol: the percent alcohol content of the wine

Output variable (based on sensory data):

  1. quality (score between 0 and 10)

Source: Click.

Univariate Plots Section

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## [1] 1599   13

There 1599 observations and 12 variables. The variable “X” is not required and can be deleted. Now let’s look at the some of the first and last few observations and structure of the data.

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1594           6.8            0.620        0.08            1.9     0.068
## 1595           6.2            0.600        0.08            2.0     0.090
## 1596           5.9            0.550        0.10            2.2     0.062
## 1597           6.3            0.510        0.13            2.3     0.076
## 1598           5.9            0.645        0.12            2.0     0.075
## 1599           6.0            0.310        0.47            3.6     0.067
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 1594                  28                   38 0.99651 3.42      0.82
## 1595                  32                   44 0.99490 3.45      0.58
## 1596                  39                   51 0.99512 3.52      0.76
## 1597                  29                   40 0.99574 3.42      0.75
## 1598                  32                   44 0.99547 3.57      0.71
## 1599                  18                   42 0.99549 3.39      0.66
##      alcohol quality
## 1594     9.5       6
## 1595    10.5       5
## 1596    11.2       6
## 1597    11.0       6
## 1598    10.2       5
## 1599    11.0       6
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

All are numerical variables except quality of wine which is integer. There are no missing values in data set. Now, let’s look at the summary of the data.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
  1. There is not much variation in the pH and density values of the wines.
  2. There is huge range of values in the free and total sulfur dioxide.
  3. Most of the wines are of average quality.

The most of the wine samples are of average quality with rating of 5 and 6. The quality variable can be the categorical variable with quality levels from 1 to 10. Wine samples with ratings of 1 being the worst and 10 being the best. The new categorical variable as described below is created with name “fquality”.

## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ fquality            : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## # A tibble: 6 x 3
##   fquality     n proportion
##      <ord> <int>      <dbl>
## 1        3    10       0.01
## 2        4    53       0.03
## 3        5   681       0.43
## 4        6   638       0.40
## 5        7   199       0.12
## 6        8    18       0.01

Around 83 % of wines are of average quality [rating 5 and 6]. Around 4 % wines are of worst quality [rating 3 and 4] and better quality [rating 7 and 8] are around 13 %. There are less number of best and worst quality wines.


## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.700   7.150   7.500   8.360   9.875  11.600 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.779   8.400  12.500 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.100   7.800   8.167   8.900  15.900 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.000   7.900   8.347   9.400  14.300 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.800   8.872  10.100  15.600 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.250   8.250   8.567  10.225  12.600

Most of the wine samples have fixed acidity of in between 6 to 11 g/dm^3. The histogram is right skewed with lot of outliers in the data on higher side. Due to long tailed distribution the mean (8.32) of the samples is greater than that of the median (7.9) of the sample. The median and the mean of the fixed acidity is on slightly higher side for wines with the quality rating of 7 and 8.

We transform the data to check the normal distribution of the data.

The log transformed data is fairly normal with outliers on both sides of the distribution. The peak of the data occurs at around 7 g/dm^3.


## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

The distribution volatile acidity is right or positively skewed. With increasing the quality of the wine the mean and median of the volatile acidity decreases. As rightly mentioned in the variable description, too high of levels of volatile acidity can lead to an unpleasant, vinegar taste. Lets transform the data to check the normal distribution.

We can see that most of the samples lie in between 0.3 to 0.8 range. The best quality wines have volatile acidity of around 0.37 to 0.4, distribution is fairly normal with few outliers on both side of the distribution.


## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

Citric acid is used for adding flevour and ‘freshness’. It has long tailed distribution which is positively skewed. The distribution has multiple modes. Most of the observations following in between 0 to 0.5, we can see that best quality wines have higher citric acid (mean and median) levels of around 0.4. It will be interesting to see the bi-variate relationship with quality of wines.


## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.875   2.100   2.635   3.100   5.700 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.900   2.100   2.694   2.800  12.900 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.200   2.529   2.600  15.500 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.477   2.500  15.400 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.721   2.750   8.900 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.800   2.100   2.578   2.600   6.400

The distribution is highly right skewed with peak occurring at around 2 gm/dm^3. Most of the samples have residual sugars of around 0.5 to 3 gm/dm^3. The residual sugar is more or less constant across the different quality of wines.

Even with log distribution the data still remains non normal that is positively skewed.


## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0610  0.0790  0.0905  0.1225  0.1430  0.2670 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600

The distribution of chlorides is highly skewed with lot of outlines on higher side. The median of chlorides is on lower side for wines with quality of 7 and 8.

The log distribution is fairly normal with outliers on both sides.


## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.0     6.0    11.0    14.5    34.0 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   12.26   15.00   41.00 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   15.00   16.98   23.00   68.00 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   14.00   15.71   21.00   72.00 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   14.05   18.00   54.00 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00    7.50   13.28   16.50   42.00

The free sulfur dioxide is long tailed and positively skewed and outliers are on higher side. The range of values is also high. The average quality wines have slightly higher levels of free sulfur dioxide. It prevents microbial growth and the oxidation of wine

The log distribution is fairly normal.


## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    12.5    15.0    24.9    42.5    49.0 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   14.00   26.00   36.25   49.00  119.00 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   26.00   47.00   56.51   84.00  155.00 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   35.00   40.87   54.00  165.00 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.50   27.00   35.02   43.00  289.00 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   16.00   21.50   33.44   43.00   88.00

The distribution highly skewed with huge range. There outliers on higher side. The mean of total sulfur dioxide is on lower side for wines with rating of 7 and 8. There are two data points on extreme right side which needs further investigation.

##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1080           7.9              0.3        0.68            8.3      0.05
## 1082           7.9              0.3        0.68            8.3      0.05
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 1080                37.5                  278 0.99316 3.01      0.51
## 1082                37.5                  289 0.99316 3.01      0.51
##      alcohol quality fquality
## 1080    12.3       7        7
## 1082    12.3       7        7

We can see that that all feature values are same except total sulfur dioxide which is unusually high. This could be copy paste or typo error. We can delete these two extreme observations from the data set and again check the distribution of Total Sulfur dioxide.

## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.17   62.00  165.00 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    12.5    15.0    24.9    42.5    49.0 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   14.00   26.00   36.25   49.00  119.00 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   26.00   47.00   56.51   84.00  155.00 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   35.00   40.87   54.00  165.00 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.0    17.0    27.0    32.5    43.0   106.0 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   16.00   21.50   33.44   43.00   88.00

The distribution looks fairly normal with no outliers.


## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9968  0.9978  1.0037 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9947  0.9961  0.9976  0.9975  0.9988  1.0008 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9957  0.9965  0.9965  0.9974  1.0010 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9926  0.9962  0.9970  0.9971  0.9979  1.0031 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9954  0.9966  0.9966  0.9979  1.0037 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9948  0.9959  0.9961  0.9974  1.0032 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9908  0.9942  0.9949  0.9952  0.9972  0.9988

The distribution of density is normal with mean, median and mode occurring at around 0.998. The density of wines is almost constant across different quality of wines.


## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.160   3.312   3.390   3.398   3.495   3.630 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.370   3.382   3.500   3.900 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.300   3.305   3.400   3.740 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.920   3.200   3.290   3.294   3.380   3.780 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.163   3.230   3.267   3.350   3.720

The distribution of pH is fairly normal with mean, median and mode occurring at 3.3. There are outliers on both the side of the distribution. pH is on lower side for wines with quality of 7 and 8.


## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6583  0.7300  2.0000 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7436  0.8300  1.3600 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

The distribution of sulfates is long tailed and positively skewed. The peak is occurring at around 0.6. There are so many outliers in the data set. The sulfates are on slightly higher side for quality higher quality of wine. The sulfates might have impact on deciding the quality of wines.

The distribution looks fairly normal with outliers on right side.


## [[1]]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90 
## 
## [[2]]
## redwine$fquality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## redwine$fquality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## redwine$fquality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## redwine$fquality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## redwine$fquality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.46   12.10   14.00 
## -------------------------------------------------------- 
## redwine$fquality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

The distribution of alcohol is long tailed and positively skewed with few outliers on higher side. Most of the values lies in between 9 to 12.
We can also see that the mean and median of the alcohol is on higher side for the wine qualities of 7 and 8 and alcohol is on lower side for low quality wines. In fact all summary parameters like min, median, max increases with increasing the quality of the wine. It looks like alcohol has huge influence on deciding the quality of the wine. It will be interesting to see the effect of alcohol coupled with other features of the data set on deciding the quality of the wine. Let’s see the log distribution below.

The distribution fairly looks normal with the peak occurring at 9.5


Univariate Analysis

What is the structure of your dataset?

There are 1599 observations with 12 variables. All variables are of numerical type except a output quality variable which is a integer. The data is tidy with no missing values.

What is/are the main feature(s) of interest in your dataset?

Alcohol, Volatile Acidity, Fixed Acidity, Citric Acid and pH are main features of the dataset.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Residual sugar, Suplhates and Total Sulfur dioxide.

Did you create any new variables from existing variables in the dataset?

The new categorical variable is created from integer variable of quality of wine.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

In total sulfur dioxide there are two outliers with extreme levels, values are for other variables are same. I decided to delete these rows, assuming it is bad data of extreme case. This may be good data with extreme case and deleting these two rows may not have any impact on analysis. Since, data is tidy no other operations were performed.

Bivariate Plots Section

Observations

  • Fixed acidity is positively correlated with citric acid and density of wines and negatively correlated with pH. These are moderate correlation with correlation coefficient of around 0.7.

  • Volatile acidity is negatively correlated with citric acid.

  • Citric acid negatively correlated with pH.

  • Residual sugar has no correlation with any other features.

  • Free sulfur dioxide has moderate correlation with total sulfur dioxide

  • Density is negatively correlated with alcohol

  • Alcohol is has positive correlation with quality of wines.

Now lets look at the scatter plots and corresponding correlation coefficient of above mentioned features of importance and interest.

## [1] "Coorelation coefficient between: alcohol and quality is 0.47"

There is weak positive correlation of alcohol with the quality of wines.

## [1] "Coorelation coefficient between: volatile.acidity and quality is -0.39"

There is weak negative correlation with volatile acidity with quality of wines.

## [1] "Coorelation coefficient between: fixed.acidity and citric.acid is 0.67"

Fixed acidity and citric acid has moderate positive correlation.

## [1] "Coorelation coefficient between: fixed.acidity and density is 0.67"

Again fixed acidity has moderate positive correlation with density of wines.

## [1] "Coorelation coefficient between: fixed.acidity and pH is -0.69"

fixed acidity has moderate negative correlation with density of wines.

## [1] "Coorelation coefficient between: citric.acid and volatile.acidity is -0.55"

citric acid and volatile acidity are positively correlated. There is weak relationship in between them.

## [1] "Coorelation coefficient between: free.sulfur.dioxide and total.sulfur.dioxide is 0.67"

Since free sulfur dioxide is part of total sulfur dioxide. This relationship is on expected lines [moderate and positive correlation]

## [1] "Coorelation coefficient between: density and alcohol is -0.49"

Density and alcohol has weak negative correlation.

## [1] "Coorelation coefficient between: citric.acid and pH is -0.54"

similarly citric acid and pH also has weak and negative correlation.

Wines with quality rating of 7 and 8 have alcohol levels above 11. We can also see there is little bit of trend. With increase in levels of alcohol quality of wines also increases. This can also be observed in scatter plot of these features.

Lower volatile acidity have better quality wines. It will be interesting to see the combined effect of volatile acidity and alcohol in multivariate analysis.

If we look at the median of the box plots we can see a clear cut trend of quality of wines with citric acid. quality of wines increases with increase in levels of citric acid.

Sulphates also have positive impact on quality of wines.

Better quality wines have slightly lower levels of pH.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • Better quality wines have higher levels of alcohol, citric acid and sulphates and lower levels of pH and volatile acidity.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

No

What was the strongest relationship you found?

It can can not be called strongest but there is moderate relationship among features mentioned in above observations of correlation plots.

Multivariate Plots Section

We can see that there is clear pattern emerging from the plot. The light orange dots coorespoding to quality rating of 5 and 6 are concentrated towards left middle portion of the plot. This the portion of higher volatile acidity and lower alcohol levels. These are medium quality wines [quality rating 5]. The blue and green dots corresponding to quality rating of 7 and 8 are concentrated towards lower middle portion of the graph. This is the region where alcohol is on higher side and volatile acidity is on lower side. This plot helps us in differentiating average quality wines from better quality wines.

This is different representation of above mutivariate scatter plot. The box plot has been created to make clear stratification of volatile acidity data. This helps in making colatile acidity as categorical variable than continous variable. In this plot I created three panels for different levels of volatile acidity. We can clearly see that for lower levels of volatile acidity and higher level of alcohol, wines are of better quality. This plot further bolsters our understanding about different features of this data set. We can further see that for higher levels of volatile acidity there are no wines of better quality [wines with quality rating of 7 and 8] as box plots for those qualities are absent.

We can see that blue and green [quality rating 7 and 8] dots concentration on right upper side of the plot. Higher level of citric acid and alcohol produces better quality wines. Average wine quality has lower levels of alcohol and lower level of citric acid. Even though some of blue and pink dots which have low levels of alcohol have high levels of citric acid. The wines have been rated high due high levels citric acid.

Higher alcohol and sulphates levels less than 1.16 produces wines wines of better quality. We can see that through above plots that alcohol has huge influence on making wines better.

We can see that there is no clear cut pattern in above scatter plot.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

  • Lower volatile acidity and higher alcohol produces better quality wines.
  • Little higher volatile acidity and lower levels of alcohol produces average quality wines.
  • Higher levels of citric acid and alcohol produces better quality wines.

Were there any interesting or surprising interactions between features?

Yes there are some some surprising relationships. If we look at the better quality wines which have low level of alchol have high levels of citric acid. That sweetness effect is compensating for low levels of alcohol.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

No models were created from the given dataset.


Final Plots and Summary

Plot One

Description One

Most of the wines are of average quality. Around 83 % of wines are of average quality [rating 5 and 6]. Around 4 % wines are of worst quality [rating 3 and 4] and better quality [rating 7 and 8] are around 13 %. There are less number of best and worst quality wines. It would have been better if the quality of wines could have been distributed equally across all qualities of wines.

Plot Two

Description Two

Box plots have been useful in understanding features impacting the quality of wines. Alcohol seems to have major impact in deciding the quality of wines. We can clearly see from the box plot that with increasing alcohol wines quality becomes better.

Plot Three

Description Three

Volatile acidity and alcohol are two most important features of this data set. We can see concentration of blue and green plots towards lower right side of the plot. This indicates that lower volatile acidity and higher alcohol makes a better quality wines.


Reflection

I carried out exploratory data analysis on red wine data set. This dataset has 1599 observation and 12 variables. There is one outcome variable which is quality of wines and other 11 are input/predictor variables.

In uni-variate analysis I started investigating individual variables. Most of the wines are of average quality. Most of the features are positively skewed and having long tailed distribution.

In bi-variate analysis we saw that, alcohol, volatile acidity, sulphates, chlorides, pH and citric acid have major influence on quality of wines. Fixed acidity, total acidity, residual sugar, free sulfur dioxide, and total sulfur dioxide have no major influence on quality of wines.

The data set has 1599 observation but quality of wines is not equally distributed across all qualities of wine. The are more average quality wines than better and worst quality wines. Differentiating bad wines from better one could have been more easier if wine qualities would have equal distribution.

In future cost of wines can be included. It will be interesting to see how quality of wines relates to its price. The design of experiment could be carried out by fixing some of the input or predictor variables and varying one or two other variables to understand impact on quality of wines.